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Abstract 

Describing images with text is a fundamen¬ 
tal problem in vision-language research. Cur¬ 
rent studies in this domain mostly focus on 
single image captioning. However, in vari¬ 
ous real applications (e.g., image editing, dif¬ 
ference interpretation, and retrieval), generat¬ 
ing relational captions for two images, can 
also be very useful. This important problem 
has not been explored mostly due to lack of 
datasets and effective models. To push for¬ 
ward the research in this direction, we first 
introduce a new language-guided image edit¬ 
ing dataset that contains a large number of 
real image pairs with corresponding editing in¬ 
structions. We then propose a new relational 
speaker model based on an encoder-decoder 
architecture with static relational attention and 
sequential multi-head attention. We also ex¬ 
tend the model with dynamic relational atten¬ 
tion, which calculates visual alignment while 
decoding. Our models are evaluated on our 
newly collected and two public datasets con¬ 
sisting of image pairs annotated with relation¬ 
ship sentences. Experimental results, based on 
both automatic and human evaluation, demon¬ 
strate that our model outperforms all baselines 
and existing methods on all the datasets.^ 

1 Introduction 

Generating captions to describe natural images is 
a fundamental research problem at the intersection 
of computer vision and natural language process¬ 
ing. Single image captioning (Mori et al., 1999; 
Farhadi et al., 2010; Kulkarni et al., 2011) has 
many practical applications such as text-based im¬ 
age search, photo curation, assisting of visually- 
impaired people, image understanding in social 

^Our data and code are publicly available at: 

https://github.com/airsplay/ 

VisualRelationships 



Relational 

Speaker 


Remove the people from the picture. 

Figure 1: An example result of our method showing 
the input image pair from our Image Editing Request 
dataset, and the output instruction predicted by our re¬ 
lational speaker model trained on the dataset. 

media, etc. This task has drawn significant at¬ 
tention in the research community with numerous 
studies (Vinyals et al., 2015; Xu et al., 2015; An¬ 
derson et al., 2018), and recent state of the art 
methods have achieved promising results on large 
captioning datasets, such as MS COCO (Lin et al., 
2014). Besides single image captioning, the com¬ 
munity has also explored other visual captioning 
problems such as video captioning (Venugopalan 
et al., 2015; Xu et al., 2016), and referring expres¬ 
sions (Kazemzadeh et al., 2014; Yu et al., 2017). 
However, the problem of two-image captioning, 
especially the task of describing the relationships 
and differences between two images, is still under¬ 
explored. In this paper, we focus on advanc¬ 
ing research in this challenging problem by intro¬ 
ducing a new dataset and proposing novel neural 
relational-speaker models. 

To the best of our knowledge, Jhamtani and 
Berg-Kirkpatrick (2018) is the only public dataset 
aimed at generating natural language descriptions 
for two real images. This dataset is about ‘spotting 
the difference’, and hence focuses more on de¬ 
scribing exhaustive differences by learning align¬ 
ments between multiple text descriptions and mul- 









tiple image regions; hence the differences are in¬ 
tended to be explicitly identifiable by subtracting 
two images. There are many other tasks that re¬ 
quire more diverse, detailed and implicit relation¬ 
ships between two images. Interpreting image 
editing effects with instructions is a suitable task 
for this purpose, because it has requirements of 
exploiting visual transformations and it is widely 
used in real life, such as explanation of complex 
image editing effects for laypersons or visually- 
impaired users, image edit or tutorial retrieval, and 
language-guided image editing systems. We first 
build a new language-guided image editing dataset 
with high quality annotations by (1) crawling im¬ 
age pairs from real image editing request web¬ 
sites, (2) annotating editing instructions via Ama¬ 
zon Mechanical Turk, and (3) refining the annota¬ 
tions through experts. 

Next, we propose a new neural speaker model 
for generating sentences that describe the vi¬ 
sual relationship between a pair of images. Our 
model is general and not dependent on any spe¬ 
cific dataset. Starting from an attentive encoder- 
decoder baseline, we first develop a model en¬ 
hanced with two attention-based neural compo¬ 
nents, a static relational attention and a sequential 
multi-head attention, to address these two chal¬ 
lenges, respectively. We further extend it by de¬ 
signing a dynamic relational attention module to 
combine the advantages of these two components, 
which finds the relationship between two images 
while decoding. The computation of dynamic re¬ 
lational attention is mathematically equivalent to 
attention over all visual “relationships”. Thus, our 
method provides a direct way to model visual re¬ 
lationships in language. 

To show the effectiveness of our models, we 
evaluate them on three datasets: our new dataset, 
the ”Spot-the-Diff ’ dataset (Jhamtani and Berg- 
Kirkpatrick, 2018), and the two-image visual rea¬ 
soning NLVR2 dataset (Suhr et al., 2019) (adapted 
for our task). We train models separately on each 
dataset with the same hyper-parameters and eval¬ 
uate them on the same test set across all methods. 
Experimental results demonstrate that our model 
outperforms all the baselines and existing meth¬ 
ods. The main contributions of our paper are: (1) 
We create a novel human language guided image 
editing dataset to boost the study in describing vi¬ 
sual relationships; (2) We design novel relational- 
speaker models, including a dynamic relational 


attention module, to handle the problem of two- 
image captioning by focusing on all their visual 
relationships; (3) Our method is evaluated on sev¬ 
eral datasets and achieves the state-of-the-art. 

2 Datasets 

We present the collection process and statistics 
of our Image Editing Request dataset and briefly 
introduce two public datasets (viz., Spot-the-Diff 
and NLVR2). All three datasets are used to study 
the task of two-image captioning and evaluating 
our relational-speaker models. Examples from 
these three datasets are shown in Eig. 2. 

2.1 Image Editing Request Dataset 

Each instance in our dataset consists of an image 
pair (i.e., a source image and a target image) and a 
corresponding editing instruction which correctly 
and comprehensively describes the transformation 
from the source image to the target image. Our 
collected Image Editing Request dataset will be 
publicly released along with the scripts to unify 
it with the other two datasets. 

2.1.1 Collection Process 

To create a high-quality, diverse dataset, we follow 
a three-step pipeline: image pairs collection, edit¬ 
ing instructions annotation, and post-processing 
by experts (i.e., cleaning and test set annotations 
labeling). 

Images Pairs Collection We first crawl the edit¬ 
ing image pairs (i.e., a source image and a target 
image) from posts on Reddit (Photoshop request 
subreddit)^ and Zhopped^. Posts generally start 
with an original image and an editing specifica¬ 
tion. Other users would send their modified im¬ 
ages by replying to the posts. We collect original 
images and modified images as source images and 
target images, respectively. 

Editing Instruction Annotation The texts in 
the original Reddit and Zhopped posts are too 
noisy to be used as image editing instructions. To 
address this problem, we collect the image edit¬ 
ing instructions on MTurk using an interactive in¬ 
terface that allows the MTurk annotators to either 
write an image editing instruction corresponding 
to a displayed image pair, or flag it as invalid (e.g., 
if the two images have nothing in common). 

^https://www.reddit.com/r/photoshoprequest 
^ http ://zhopped. com 



Ours (Image Editing Request) 


Spot-the-Diff 



Add a sword and a eloak to the squirrel. 



The blue tmek is no longer there. 

A ear is approaehing the parking lot from the right. 


NLVR2 Captioning 



Eaeh image shows a row of dressed dogs posing 
with a eat that is also wearing some garment. 


NLVR2 Classifieation 





Each image shows a row of 
dressed dogs posing with a eat —True 
that is also wearing some garment. 



In at least one of the images, 
six dogs are posing for a pieture, 
while on a beneh. 


—► False 


Figure 2: Examples from three datasets: our Image Editing Request, Spot-the-Diff, and NLVR2. Each example in¬ 
volves two natural images and an associated sentence describing their relationship. The task of generating NLVR2 
captions is converted from its original classification task. 



B-1 

B-2 

B-3 

B-4 

Rouge-L 

Ours 

52 

34 

21 

13 

45 

Spot-the-Diff 

41 

25 

15 

8 

31 

MS COCO 

38 

22 

15 

8 

34 


Table I: Human agreement on our datasets, compared 
with Spot-the-Diff and MS COCO (captions=3). B-I 
to B-4 are BLEU-1 to BLEU-4. Our dataset has the 
highest human agreement. 


Post-Processing by Experts Mturk annotators 
are not always experts in image editing. To ensure 
the quality of the dataset, we hire an image edit¬ 
ing expert to label each image editing instruction 
of the dataset as one of the following four options: 
1. correct instruction, 2. incomplete instruction, 3. 
implicit request, 4. other type of errors. Only the 
data instances labeled with “correct instruction” 
are selected to compose our dataset, and are used 
in training or evaluating our neural speaker model. 

Moreover, two additional experts are required to 
write two more editing instructions (one instruc¬ 
tion per expert) for each image pair in the valida¬ 
tion and test sets. This process enables the dataset 
to be a multi-reference one, which allows vari¬ 
ous automatic evaluation metrics, such as BLEU, 
CIDEr, and ROUGE to more accurately evaluate 
the quality of generated sentences. 


2.1.2 Dataset Statistics 

The Image Editing Request dataset that we have 
collected and annotated currently contains 3,939 
image pairs (3061 in training, 383 in validation, 
495 in test) with 5,695 human-annotated instruc¬ 
tions in total. Each image pair in the training set 
has one instruction, and each image pair in the val¬ 
idation and test sets has three instructions, written 
by three different annotators. Instructions have an 
average length of 7.5 words (standard deviation: 
4.8). After removing the words with less than 
three occurrences, the dataset has a vocabulary of 
786 words. The human agreement of our dataset 
is shown in Table 1. The word frequencies in our 
dataset are visualized in Eig. 3. Most of the images 
in our dataset are realistic. Since the task is im¬ 
age editing, target images may have some artifacts 
(see Image Editing Request examples in Eig. 2 and 
Fig. 5). 

2.2 Existing Public Datasets 

To show the generalization of our speaker model, 
we also train and evaluate our model on two pub¬ 
lic datasets, Spot-the-Diff (Jhamtani and Berg- 
Kirkpatrick, 2018) and NLVR2 (Suhr et al., 2019). 
Instances in these two datasets are each composed 
of two natural images and a human written sen¬ 
tence describing the relationship between the two 



























Figure 3: Word cloud showing the vocabulary frequen¬ 
cies of our Image Editing Request dataset. 


tasks: “before” and “after” in Spot-the-Diff, “left” 
and “right” in NLVR2, “source” and “target” in 
our Image Editing Request dataset. We use the 
nomenclature of “source” and “target” for simpli¬ 
fication, but our model is general and not designed 
for any specific dataset. Formally, the model gen¬ 
erates a sentence {wi^W 2 ^ describing the 

relationship between the source image and 
the target image are the word to¬ 

kens with a total length of T. P^^ and P^^ are 
natural images in their raw RGB pixels. In the 
rest of this section, we first introduce our basic at¬ 
tentive encoder-decoder model, and show how we 
gradually improve it to fit the task better. 


images. To the best of our knowledge, these are 
the only two public datasets with a reasonable 
amount of data that are suitable for our task. We 
next briefly introduce these two datasets. 

Spot-the-Diff This dataset is designed to help 
generate a set of instructions that can comprehen¬ 
sively describe all visual differences. Thus, the 
dataset contains images from video-surveillance 
footage, in which differences can be easily found. 
This is because all the differences could be effec¬ 
tively captured by subtractions between two im¬ 
ages, as shown in Fig. 2. The dataset contains 
13,192 image pairs, and an average of 1.86 cap¬ 
tions are collected for each image pair. The dataset 
is split into training, validation, and test sets with 
a ratio of 8:1:1. 

NLVR2 The original task of Cornell Natural 
Fanguage for Visual Reasoning (NLVR2) dataset 
is visual sentence classification, see Fig. 2 for an 
example. Given two related images and a natu¬ 
ral language statement as inputs, a learned model 
needs to determine whether the statement cor¬ 
rectly describes the visual contents. We convert 
this classification task to a generation task by tak¬ 
ing only the image pairs with correct descriptions. 
After conversion, the amount of data is 51,020, 
which is almost half of the original dataset with 
a size of 107,296. We also preserve the training, 
validation, and test split in the original dataset. 

3 Relational Speaker Models 

In this section, we aim to design a general speaker 
model that describes the relationship between two 
images. Due to the different kinds of visual rela¬ 
tionships, the meanings of images vary in different 


3.1 Basic Model 

Our basic model (Fig. 4(a)) is similar to the 
baseline model in Jhamtani and Berg-Kirkpatrick 
(2018), which is adapted from the attentive 
encoder-decoder model for single image caption¬ 
ing (Xu et al., 2015). We use ResNet-101 (He 
et al., 2016) as the feature extractor to encode the 
source image P^^ and the target image p^^. The 
feature maps of size N x N x 2048 are extracted, 
where N is the height or width of the feature map. 
Each feature in the feature map represents a part 
of the image. Feature maps are then flattened to 
two N‘^ X 2048 feature sequences and 
which are further concatenated to a single feature 
sequence /. 

/SRC _ ResNet (P^^) (1) 

/TRG = ResNet (P^^) (2) 

f _ r fSRC pTRG rTRGi /q\ 

J — 5 • • • 5 /a^2 , /^2 J 

At each decoding step f, the FSTM cell takes the 
embedding of the previous word wt-i as an input. 
The word wt-i either comes from the ground truth 
(in training) or takes the token with maximal prob¬ 
ability (in evaluating). The attention module then 
attends to the feature sequence / with the hidden 
output ht as a query. Inside the attention module, 
it first computes the alignment scores at^i between 
the query p and each fi. Next, the feature se¬ 
quence / is aggregated with a weighted average 
(with a weight of a) to form the image context /. 
Lastly, the context ft and the hidden vector ht are 
merged into an attentive hidden vector ht with a 



(a) Basic Model 


(b) Multi-Head Attention 


(d) Dynamic Relational Attention 



Figure 4: The evolution diagram of our models to describe the visual relationships. One decoding step at t is 
shown. The linear layers are omitted for clarity. The basic model (a) is an attentive encoder-decoder model, which 
is enhanced by the multi-head attention (b) and static relational attention (c). Our best model (d) dynamically 
computes the relational scores in decoding to avoid losing relationship information. 


fully-connected layer: 

Wt-i — embedding {wt-i) (4) 

= LSTM(«;t_i,/it_i,ct_i) (5) 

at,i = softmaxj Wimg/i) (6) 

ft = e) 

i 

ht = tanh(lVi[/t; ht] + h) (8) 


The probability of generating the k-th word token 
at time step t is softmax over a linear transforma¬ 
tion of the attentive hidden ht. The loss Ct is the 
negative log likelihood of the ground truth word 
token wl'. 

= softmaxfc ht + (9) 

Ct = - logpt{wl) ( 10 ) 

3.2 Sequential Multi-Head Attention 

One weakness of the basic model is that the plain 
attention module simply takes the concatenated 
image feature / as the input, which does not differ¬ 
entiate between the two images. We thus consider 
applying a multi-head attention module (Vaswani 


et al., 2017) to handle this (Fig. 4(b)). Instead of 
using the simultaneous multi-head attention ^ in 
Transformer (Vaswani et al., 2017), we implement 
the multi-head attention in a sequential way. This 
way, when the model is attending to the target im¬ 
age, the contextual information retrieved from the 
source image is available and can therefore per¬ 
form better at differentiation or relationship learn¬ 
ing. 

In detail, the source attention head first attends 
to the flattened source image feature The 

attention module is built in the same way as in 
Sec. 3.1, except that it now only attends to the 
source image: 

at'f - soitms^(hJWsncfr'^) ( 11 ) 

/r = E“<T-fr' (12) 

i 

hl^^ = t&nh{W2[fr-,ht] + b2) (13) 

The target attention head then takes the out¬ 
put of the source attention hf^^ as a query to re¬ 
trieve appropriate information from the target fea- 

also tried the original multi-head attention but it is 
empirically weaker than our sequential multi-head attention. 






































































































































































































ture 7 ™°: 


= softmaxj(^f’^‘^'''VFTRG/j™°) (14) 

rTRG \ ^ ^ TRG nTRG /i 

ft =2^^tjfj (15) 

3 

= tanh(W^3[/t™°; hf^] + b3) (16) 

In place of ht, the output of the target head h™ is 
used to predict the next word.^ 

3.3 Static Relational Attention 

Although the sequential multi-head attention 
model can learn to differentiate the two images, 
visual relationships are not explicitly examined. 
We thus allow the model to statically (i.e., not in 
decoding) compute the relational score between 
source and target feature sequences and reduce 
the scores into two relationship-aware feature se¬ 
quences. We apply a bi-directional relational at¬ 
tention (Fig. 4(c)) for this purpose: one from the 
source to the target, and one from the target to the 
source. For each feature in the source feature se¬ 
quence, the source-to-target attention computes its 
alignment with the features in the target feature se¬ 
quences. The source feature, the attended target 
feature, and the difference between them are then 
merged together with a fully-connected layer: 

a'/' =sottmaXj{{»'s/|“)"^(W'T/J“)) (17) 

/r' = E“MV™ (18) 

3 

// = tanh(W^4[/r;/ri + ?>4) (19) 

We decompose the attention weight into two small 
matrices Ws and so as to reduce the number 
of parameters, because the dimension of the im¬ 
age feature is usually large. The target-to-source 
cross-attention is built in an opposite way: it takes 
each target feature as a query, attends to the 
source feature sequence, and get the attentive fea¬ 
ture fj. We then use these two bidirectional at¬ 
tentive sequences ff and fJ in the multi-head at¬ 
tention module (shown in previous subsection) at 
each decoding step. 

3.4 Dynamic Relational Attention 

The static relational attention module compresses 
pairwise relationships (of size N^) into two 

^We tried to exchange the order of two heads or have two 
orders concurrently. We didn’t see any significant difference 
in results between them. 


relationship-aware feature sequences (of size 2 x 
W^). The compression saves computational re¬ 
sources but has potential drawback in information 
loss as discussed in Bahdanau et al. (2015) and Xu 
et al. (2015). In order to avoid losing information, 
we modify the static relational attention module 
to its dynamic version, which calculates the rela¬ 
tional scores while decoding (Fig. 4(d)). 

At each decoding step t, the dynamic relational 
attention calculates the alignment score at^ij be¬ 
tween three vectors: a source feature a tar¬ 
get feature and the hidden state ht. Since 
the dot-product used in previous attention modules 
does not have a direct extension for three vectors, 
we extend the dot product and use it to compute 
the three-vector alignment score. 

dot(a;, y) = '^xdyd = x'^y ( 20 ) 

d 

dot*{x,y,z) = '^XdydZd^ {xQy)^z ( 21 ) 

d 

= dot*(lVsK/f^ Wtk/™, W^Kht) (22) 
= {Ws^fr © (23) 

where © is the element-wise multiplication. 

The alignment scores (of size N^) are normal¬ 
ized by softmax. And the attention information is 
fused to the attentive hidden vector as previous. 


= softmaxij (24) 

/r- = Ea..u/r (25) 

hj 

PtRG-D ^ rTRG 

ft ^ Z^^t,ijjj (26) 

ff = ht]+b,) (27) 

= tanh(lV5s/f+ W5T/r°-°+ 

Wsnht + h) (28) 


where VFss, VFst, W/sh are sub-matrices of IT 5 and 
W5 = [^5s,W5t,W5h]. 

According to Eqn. 23 and Eqn. 28, we find an 
analog in conventional attention layers with fol¬ 
lowing specifications: 

• Query: ht 

. Key: Ws^f^ ® 

• Value: Wss/f ^ 

The key and the value 

can be considered as repre¬ 
sentations of the visual relationships between 



Method 

BLEU-4 

CIDEr 

METEOR 

ROUGE-L 

Our Dataset (Image Editing Request) 

basic model 

5.04 

21.58 

11.58 

34.66 

-i-multi-head att 

6.13 

22.82 

11.76 

35.13 

-i-static rel-att 

5.76 

20.70 

12.59 

35.46 

-static -i-dynamic rel-att 

6.72 

26.36 

12.80 

37.25 

Spot-the-Diff 

CAPT( Jhamtani and Berg-Kirkpatrick, 2018) 

7.30 

26.30 

10.50 

25.60 

DDLA( Jhamtani and Berg-Kirkpatrick, 2018) 

8.50 

32.80 

12.00 

28.60 

basic model 

5.68 

22.20 

10.98 

24.21 

-i-multi-head att 

7.52 

31.39 

11.64 

26.96 

-i-static rel-att 

8.31 

33.98 

12.95 

28.26 

-static -i-dynamic rel-att 

8.09 

35.25 

12.20 

31.38 

NLVR2 

basic model 

5.04 

43.39 

10.82 

22.19 

-I-multi-head att 

5.11 

44.80 

10.72 

22.60 

-I-static rel-att 

4.95 

45.67 

10.89 

22.69 

-static -I-dynamic rel-att 

5.00 

46.41 

10.37 

22.94 


Table 2: Automatic metric of test results on three datasets. Best results of the main metric are marked in bold font. 
Our full model is the best on all three datasets with the main metric. 


and fj^^. It is a direct attention to the visual re¬ 
lationship between the source and target images, 
hence is suitable for the task of generating rela¬ 
tionship descriptions. 

4 Results 

To evaluate the performance of our relational 
speaker models (Sec. 3), we trained them on all 
three datasets (Sec. 2). We evaluate our models 
based on both automatic metrics as well as pair¬ 
wise human evaluation. We also show our gener¬ 
ated examples for each dataset. 

4.1 Experimental Setup 

We use the same hyperparameters when applying 
our model to the three datasets. Dimensions of 
hidden vectors are 512. The model is optimized 
by Adam with a learning rate of le — 4. We 
add dropout layers of rate 0.5 everywhere to avoid 
over-fitting. When generating instructions for 
evaluation, we use maximum-decoding: the word 
wt generated at time step t is 8iTgma,Xkp{wt^k)- 
For the Spot-the-Diff dataset, we take the “Single 
sentence decoding” experiment as in Jhamtani and 
Berg-Kirkpatrick (2018). We also try to mix the 
three datasets but we do not see any improvement. 
We also try different ways to mix the three datasets 
but we do not see improvement. We first train a 


unified model on the union of these datasets. The 
metrics drop a lot because the tasks and language 
domains (e.g., the word dictionary and lengths of 
sentences) are different from each other. We next 
only share the visual components to overcome the 
disagreement in language. However, the image 
domain are still quite different from each other (as 
shown in Fig. 2). Thus, we finally separately train 
three models on the three datasets with minimal 
cross-dataset modifications. 

4.2 Metric-Based Evaluation 

As shown in Table 2, we compare the performance 
of our models on all three datasets with various 
automated metrics. Results on the test sets are 
reported. Following the setup in Jhamtani and 
Berg-Kirkpatrick (2018), we takes CIDEr (Vedan- 
tam et al., 2015) as the main metric in evaluating 
the Spot-the-Diff and NLVR2 datasets. However, 
CIDEr is known as its problem in up-weighting 
unimportant details (Kilickaya et al., 2017; Liu 
et al., 2017b). In our dataset, we find that instruc¬ 
tions generated from a small set of short phrases 
could get a high CIDEr score. We thus change the 
main metric of our dataset to METEOR (Banerjee 
and Lavie, 2005), which is manually verified to be 
aligned with human judgment on the validation set 
in our dataset. To avoid over-fitting, the model is 











































Basic 

Full 

Both Good 

Both Not 

Ours(IEdit) 

11 

24 

5 

60 

Spot-the-Diff 

22 

37 

6 

35 

NLVR2 

24 

37 

17 

22 


Table 3: Human evaluation on 100 examples. Image 
pair and two captions generated by our basic model and 
full model are shown to the user. The user chooses 
one from ‘Basic’ model wins, ‘Full’ model wins, ‘Both 
Good’, or ‘Both Not’. Better model marked in bold 
font. 


we do not explicitly model the pixel-level differ¬ 
ences; however, we still find that the model could 
learn these differences in the Spot-the-Diff dataset. 
Since the descriptions in Spot-the-Diff is relatively 
simple, the errors mostly come from wrong enti¬ 
ties or undetected differences as shown in Fig. 5. 
Our model is also sensitive to the image contents 
as shown in the NLVR2 dataset. 

5 Related Work 


early-stopped based on the main metric on vali¬ 
dation set. We also report the BLEU-4 (Papineni 
et al., 2002) and ROUGE-L (Lin, 2004) scores. 

The results on various datasets shows the grad¬ 
ual improvement made by our novel neural com¬ 
ponents, which are designed to better describe the 
relationship between 2 images. Our full model has 
a significant improvement in result over baseline. 
The improvement on the NLVR2 dataset is lim¬ 
ited because the comparison of two images was 
not forced to be considered when generating in¬ 
structions. 

4.3 Human Evaluation and Qualitative 
Analysis 

We conduct a pairwise human evaluation on our 
generated sentences, which is used in Celikyil- 
maz et al. (2018) and Pasunuru and Bansal (2017). 
Agarwala (2018) also shows that the pairwise 
comparison is better than scoring sentences indi¬ 
vidually. We randomly select 100 examples from 
the test set in each dataset and generate captions 
via our full speaker model. We ask users to choose 
a better instruction between the captions generated 
by our full model and the basic model, or alter¬ 
natively indicate that the two captions are equal 
in quality. The Image Editing Request dataset is 
specifically annotated by the image editing expert. 
The winning rate of our full model (dynamic re¬ 
lation attention) versus the basic model is shown 
in Table 3. Our full model outperforms the ba¬ 
sic model significantly. We also show positive and 
negative examples generated by our full model in 
Eig. 5. In our Image Editing Request corpus, the 
model was able to detect and describe the edit¬ 
ing actions but it failed in handling the arbitrary 
complex editing actions. We keep these hard ex¬ 
amples in our dataset to match real-world require¬ 
ments and allow follow-up future works to pursue 
the remaining challenges in this task. Our model 
is designed for non-localized relationships thus 


In order to learn a robust captioning system, pub¬ 
lic datasets have been released for diverse tasks 
including single image captioning (Lin et al., 
2014; Plummer et al., 2015; Krishna et al., 2017), 
video captioning (Xu et al., 2016), referring ex¬ 
pressions (Kazemzadeh et al., 2014; Mao et al., 
2016), and visual question answering (Antol et al., 
2015; Zhu et al., 2016; Johnson et al., 2017). In 
terms of model progress, recent years witnessed 
strong research progress in generating natural lan¬ 
guage sentences to describe visual contents, such 
as Vinyals et al. (2015); Xu et al. (2015); Ran- 
zato et al. (2016); Anderson et al. (2018) in single 
image captioning, Venugopalan et al. (2015); Pan 
et al. (2016); Pasunuru and Bansal (2017) in video 
captioning, Mao et al. (2016); Liu et al. (2017a); 
Yu et al. (2017); Luo and Shakhnarovich (2017) in 
referring expressions, Jain et al. (2017); Li et al. 
(2018); Misra et al. (2018) in visual question gen¬ 
eration, and Andreas and Klein (2016); Cohn- 
Gordon et al. (2018); Luo et al. (2018); Vedantam 
et al. (2017) in other setups. 

Single image captioning is the most relevant 
problem to the two-images captioning. Vinyals 
et al. (2015) created a powerful encoder-decoder 
(i.e., CNN to LSTM) framework in solving the 
captioning problem. Xu et al. (2015) further 
equipped it with an attention module to handle 
the memorylessness of fixed-size vectors. Ran- 
zato et al. (2016) used reinforcement learning to 
eliminate exposure bias. Recently, Anderson et al. 
(2018) brought the information from object detec¬ 
tion system to further boost the performance. 

Our model is built based on the attentive 
encoder-decoder model (Xu et al., 2015), which is 
the same choice in Jhamtani and Berg-Kirkpatrick 
(2018). We apply the RL training with self- 
critical (Rennie et al., 2017) but do not see signif¬ 
icant improvement, possibly because of the rela¬ 
tively small data amount compared to MS COCO. 
We also observe that the detection system in An- 
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Spot-the-Diff 


NLVR2 


Positive 

Examples 


Negative 

Examples 



add a filter to the image 



ehange the baekground to blue 



the person in the white shirt is gone 



the blaek ear in the middle row is gone 



there is a bookshelf with a white 
shelf in one of the images . 



the left image shows a pair of 
shoes wearing a pair of shoes . 


Figure 5: Examples of positive and negative results of our model from the three datasets. Selfies are blurred. 


derson et al. (2018) has a high probability to fail 
in the three datasets, e.g., the detection system can 
not detect the small cars and people in spot-the- 
diff dataset. The DDLA (Difference Description 
with Latent Alignment) method proposed in Jham- 
tani and Berg-Kirkpatrick (2018) learns the align¬ 
ment between descriptions and visual differences. 
It relies on the nature of the particular dataset and 
thus could not be easily transferred to other dataset 
where the visual relationship is not obvious. The 
two-images captioning could also be considered 
as a two key-frames video captioning problem, 
and our sequential multi-heads attention is a modi¬ 
fied version of the seq-to-seq model (Venugopalan 
et al., 2015). Some existing work (Chen et al., 
2018; Wang et al., 2018; Manjunatha et al., 2018) 
also learns how to modify images. These datasets 
and methods focus on the image colorization and 
adjustment tasks, while our dataset aims to study 
the general image editing request task. 

6 Conclusion 

In this paper, we explored the task of describ¬ 
ing the visual relationship between two images. 
We collected the Image Editing Request dataset, 
which contains image pairs and human annotated 
editing instructions. We designed novel relational 
speaker models and evaluate them on our col¬ 
lected and other public existing dataset. Based on 
automatic and human evaluations, our relational 
speaker model improves the ability to capture vi¬ 
sual relationships. For future work, we are going 
to further explore the possibility to merge the three 
datasets by either learning a joint image represen¬ 
tation or by transferring domain-specific knowl¬ 
edge. We are also aiming to enlarge our Image 
Editing Request dataset with newly-released posts 
on Reddit and Zhopped. 
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